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Abstract 

We first explore methods for approximating the commute time and Katz score 
between a pair of nodes. These methods are based on the approach of matrices, mo- 
ments, and quadrature developed in the numerical linear algebra community. They 
rely on the Lanczos process and provide upper and lower bounds on an estimate of 
the pair-wise scores. We also explore methods to approximate the commute times 
and Katz scores from a node to all other nodes in the graph. Here, our approach for 
the commute times is based on a variation of the conjugate gradient algorithm, and 
it provides an estimate of all the diagonals of the inverse of a matrix. Our technique 
for the Katz scores is based on exploiting an empirical localization property of the 
Katz matrix. We adopt algorithms used for personalized PageRank computing to 
these Katz scores and theoretically show that this approach is convergent. We eval- 
uate these methods on 17 real world graphs ranging in size from 1000 to 1,000,000 
nodes. Our results show that our pair-wise commute time method and column-wise 
Katz algorithm both have attractive theoretical properties and empirical perfor- 
mance. 



1 Introduction 

Commute times [Gobel and Jagers, 1974] and Katz scores [Katz, 1953] are two topo- 
logical measures defined between any pair of vertices in a graph that capture their re- 
lationship due to the link structure. Both of these measures have become important 
because of the their use in social network analysis as well as applications such as link 
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prediction [Libcn-Nowcll and Kleinberg, 2003], anomalous link detection [Rattigan and 
Jensen, 2005], recommendation [Sarkar and Moore, 2007], and clustering [Saercns et al., 



For example, Liben-Nowell and Kleinberg [2003] identify a variety of topological 
measures as features for link prediction: the problem of predicting the likelihood of 
users/entities forming new connections in the future, given the current state of the net- 
work. The measures they studied fall into two categories - neighborhood-based measures 
and path-based measures. The former are cheaper to compute, yet the latter are more ef- 
fective at link prediction. Katz scores were among the most effective path-based measures 
studied by Liben-Nowell and Kleinberg [2003] , and the commute time also performed well. 

Standard algorithms to compute these measures between all pairs of nodes are often 
based on direct solution methods and require cubic time and quadratic space in the 
number of nodes of the graph. Such algorithms are impractical for modern networks 
with millions of vertices and tens of millions of edges. We explore algorithms to compute 
a targeted subset of scores that do scale to modern networks. 

Katz scores measure the affinity between nodes via a weighted sum of the number of 
paths between them. Formally, the Katz score between node i and j is 



where paths^a;, y) denotes the number of paths of length I between i to j and a < 1 is an 
attenuation parameter. Now, let A be the symmetric adjacency matrix, corresponding to 
a undirected and connected graph, and recall that (A )j j is the number of paths between 
node i and j. Then computing the Katz scores for all pairs of nodes is equivalent to the 
following computation: 



Herein, we refer to K as the Katz matrix. We shall only study this problem when I — aA 
is positive definite, which occurs when a < l/cr max (A) and also corresponds to the case 
where the series expansion converges. 

In order to define the commute time between nodes, we must first define the hitting 
time between nodes. Formally, the commute time between nodes is defined as the sum 
of hitting times from i to j and from j to i, and the hitting time from node i to j is 
the expected number of steps for a random walk started at i to visit j for the first time. 
The hitting time is computed via first-transition analysis on the random walk transition 
matrix associated with a graph. To be precise, let A, again, be the symmetric adjacency 
matrix. Let D be the diagonal matrix of degrees: 
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Table 1 - Notation 



A the symmetric adjacency matrix for a connected, undirected graph 

D the diagonal matrix of node degrees 

n the number of vertices in A 

e the vector of all ones 

&i a vector of zeros with a 1 in the ith position 

L the combinatorial Laplacian matrix of a graph, L = D — A 

L the adjusted combinatorial Laplacian, L = L + -^ee T 

a the damping parameter in Katz 

K the Katz matrix, K = (I — a A) -1 

C the commute time matrix 

Z a "general" matrix, usually I — aA or L 

The random walk transition matrix is given by P = D~ 1 A. Let Hi j be the hitting time 
from node i and j. Based on the Markovian nature of a random walk, Hij must satisfy: 

= 1 + ^2 H i>v P v j and = 0. 

V 

That is, the hitting time between i and j is 1 more than the hitting time between i and 
v, weighted by the probability of transitioning between v and j, for all v. The minimum 
non-negative solution H that obeys this equation is thus the matrix of hitting times. 
The commute time between node i and j is then: 

As a matrix C = H + if T , and we refer to C as the commute time matrix. An 
equivalent expression follows from exploiting a few relationships with the combinato- 
rial graph Laplacian matrix: L = D — A [Fouss et al., 2007]. Each element = 
Vol(G)(it . - 2L\j + L] tj ) where Vol(G) is the sum of elements in A and Lr is the 
pseudo- inverse of L. The null-space of the combinatorial graph Laplacian has a well 
known expression in terms of the connected components of the graph G. This relation- 
ship allows us to write 

L ] = (L + -ee T y 1 - -ee T 

n n 

^ v ' 

L 

for connected graphs [Saerens et al., 2004], where e is the vector of all ones, and n 
is the number of nodes in the graph. The commute time between nodes in different 
connected components is infinite, and thus we only need to consider connected graphs. 
We summarize the notation thus far, and a few subsequent definitions in Table 1. 

Computing either Katz scores or commute times between all pairs of nodes involves 
inverting a matrix: 

(I-aA)- 1 or (L+-ee T )- 1 . 
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Standard algorithms for a matrix inverse require 0(n 3 ) time and 0(n 2 ) memory. Both 
of these requirements are inappropriate for a large network (see Section 2 for a brief 
survey of existing alternatives). Inspired by applications in anomalous link detection 
and recommendation, we focus on computing only a single Katz score or commute time 
and on approximating a column of these matrices. In the former case, our goal is to 
find the score for a given pair of nodes and in the latter, it is to identify the most 
related nodes for a given node. In our vision, the pair-wise algorithms should help 
in cases where random pair-wise data is queried, for instance when checking random 
network connections, or evaluating user similarity scores as a user explores a website. 
For the column-wise algorithms, recommending the most similar nodes to a query node 
or predicting the most likely links to a given query node are both obvious applications. 

One way to compute a single score - what we term the pair-wise problem - is to find 
the value of a bilinear form: 

u T Z-\, 

where Z = (I — a A) or Z — L. An interesting approach to estimate these bilinear 
forms, and to derive computable upper and lower bounds on the value, arises from the 
relationship between the Lanczos/Stieltjes procedure and a quadrature rule [Golub and 
Mcurant, 1994]. This relationship and the resulting algorithm for a quadratic form 
(u T Z _1 u) is described in Section 4.1. Prior to that, and because it will form the basis of 
a few algorithms that we use, Section 3 first reviews the properties of the Lanczos method. 
We state the pairwise procedure for commute times and Katz scores in Section 4.2 and 4.3. 

The column-wise problem is to compute, or approximate, a column of the matrix C 
or If. A column of the commute time matrix is: 

c, = Ce { = vol(G)[(ei - e v ) T L (e* - e„) : 1 < v < n]. 

A difficulty with this computation is that it requires all of the diagonal elements of L , 
as well as the solution of the linear system L e;. We can use a property of the Lanczos 
procedure and its relationship with the conjugate gradient algorithm to solve L and 
estimate all of the diagonals of the inverse simultaneously [Paige and Saunders, 1975, 
Chantas et al., 2008]. 

A column of the Katz matrix is Kei, which corresponds to solving a single linear 
system: 

ki = Kei = (I — aij^'e,: — ej. 

Empirically, we observe that the solutions of the Katz linear system are often localized. 
That is, there are only a few large elements in the solution vector, and many negligible 
elements. See Table 2 for an example of this localization over a few graphs. In order 
to capitalize on this phenomenon, we use a generalization of "push" -style algorithms 
for personalized PageRank computing [McSherry, 2005, Andersen et al., 2006, Berkhin, 
2007]. These methods only access the adjacency information for a limited number of 



4 



vertices in the graph. In Section 5.2, we explain the generalization of these methods, 
the adaptation to Katz scores, and utilize the theory of coordinate descent optimization 
algorithms to establish convergence. As we argue in that section, these techniques might 
also be called "Gauss-Southwell" methods, based on historical precedents. 

One of the advantages of Lanczos-based algorithms is that the convergence is of- 
ten much faster than a worst-case analysis would suggest. This means studying their 
convergence by empirical means and on real data sets is important. We do so for 17 
real- world networks in Section 6, ranging in size from approximation 1,000 vertices to 
1,000,000 vertices. These experiments highlight both the strengths and weaknesses of our 
approaches, and should provide a balanced picture of our algorithms. In particular, our 
algorithms run in seconds or milliseconds - significantly faster than many techniques that 
use preprocessing to estimate all of the scores simultaneously, which can take minutes. 

Straightforward approaches based on the conjugate gradient technique are often com- 
petitive with our techniques. However, our algorithms have other desirable properties, 
such as upper and lower bounds on the solution or exploiting sparsity in the solution 
vector, which conjugate gradient does not. These experiments also shed light on a recent 
result from von Luxburg ct al. [2010] on the relationship between commute time and the 
degree distribution. 

Literature directly related to the problems we study and the techniques we propose 
is discussed throughout the paper, in context. However, we have isolated a small set of 
core related papers and discuss them in the next section. 

In the spirit of reproducible research, we make our data, computational codes, and fig- 
ure plotting codes available for others: http: //cs .purdue . edu/homes/dgleich/publications/ 
2011/codes/f ast-katz/. 

2 Related work 

This paper is about algorithms for computing commute times and Katz scores over 
networks with hundreds of thousands to millions of nodes. Most existing techniques 
determine the scores among all pairs of nodes simultaneously [Acar et al., 2009, Wang 
et al., 2007, Sarkar and Moore, 2007] (discussed below). These methods tend to involve 
some preprocessing of the graph using a one-time, rather expensive, computation. We 
instead focus on quick estimates of these measures between a single pair of nodes and 
between a single node and all other nodes in the graph. In this vein, a recent paper [Li 
et al., 2010] studies efficient computation of SimRank [Jch and Widom, 2002] for a given 
pair of nodes. 

A highly related paper is Benzi and Boito [2010], where they investigate entries 
in functions of the adjacency matrix, such as the exponential, using quadrature-based 
bounds. A priori upper and lower bounds are obtained by employing a few Lanczos steps 
and the bounds are effectively used to observe the exponential decay behavior of the 
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exponential of an adjacency matrix. 

In Sarkar and Moore [2007], an interesting and efficient approach is proposed for 
finding approximate nearest neighbors with respect to a truncated version of the commute 
time measure. Spiclman and Srivastava [2008] develop a technique for computing the 
effective resistance of all edges (which is proportional to commute time) in 0(m log n) 
time. Both of these procedures involve some preprocessing. 

Standard techniques to approximate Katz scores include truncating the series expan- 
sion to paths of length less than £ max [Foster et al., 2001, Wang et al., 2007] and low-rank 
approximation [Liben-Nowell and Kleinberg, 2003, Acar et al., 2009]. Only the former 
technique, when specialized to compute only a pair or top-fc set, has performance com- 
parable to our algorithms. However, when we tested an adapted algorithm based on the 
Neumann series expansion, it required much more work than the techniques we propose. 

As mentioned in the introduction, both commute times and Katz scores were studied 
by Liben-Nowell and Kleinberg [2003] for the task of link prediction, and were found 
to be effective. Beyond link prediction, Yen et al. [2007] use a commute time kernel 
based approach to detect clusters and show that this method outperforms other kernel 
based clustering algorithms. The authors use commute time to define a distance mea- 
sure between nodes, which in turn is used for defining a so-called intra-cluster inertia. 
Intuitively, this inertia measures how close nodes within a cluster are to each other. The 
algorithm we propose for computing the Katz and commute time score for a given pair of 
nodes x, y extends to the case where one wants to find the aggregate score between a node 
x and a set of nodes S. Consequently, this work has applications for finding the distance 
between a point and a cluster as well as for finding intra-cluster inertia. For applica- 
tions to recommender systems, Sarkar et al. [2008] used their truncated commute time 
measure for link prediction over a collaboration graph and showed that it outperforms 
personalized PageRank [Page et al., 1999]. 

3 The Lanczos Process 

The Lanczos algorithm [Lanczos, 1950] is a procedure applied to a symmetric matrix, 
which works particularly well when the given matrix is large and sparse. A sequence of 
Lanczos iterations can be thought of as "truncated" orthogonal similarity transforma- 
tions. Given an n x n matrix Z , we construct a matrix Q with orthonormal columns, 
one at a time, and perform only a small number of steps, say k, where k <^ n. The input 
for the algorithm is the matrix Z, an initial vector q and a number of steps k. Upon 
exit, we have an n x (fc + 1) matrix with orthonormal columns and a (k + 1) x k 

tridiagonal matrix Tk+i.k, that satisfy the relationship 

ZQk — Qk+iTk+i,k, 
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Algorithm 1 Lanczos(Z, q, k). 

Qi = q/l|q||2,/3o = 0,q = 
for j = 1 to k do 
z = Zq ■ 
«j = qjz 

z = z oyq,- - /^iq^-t 
ft = N| 2 

if (3j — 0, q - +1 = and quit 
else q j+1 = z/fc 



Algorithm 2 LanczosStep(Z, q( ),q, ^ 
1: z = Zq 
2: a = q T z 

3: z = z aq — /3( _ )q( _ ) 

4: /3 = ||z|| 2 

5: if /3 = 0,qW = 
6: else q(+5 = z//3 
7: return (q, a, /3) 



where Q fc is the n x k matrix that contains the first k columns of Q k+ i, and T k +i.k — 
tri(/?j, ai,0i) is tridiagonal. 

What makes the Lanczos procedure attractive are the good approximation properties 
that it has for k <C n. The matrix T k +i,k is small when k <C but the eigenvalues 
of its fc x A: upper part - a matrix we will refer to as T k in the subsequent section - 
approximate the extremal eigenvalues of the large nx n matrix Z. This can be exploited 
not only for eigenvalue computations but also for solving a linear system [Lanczos, 1952, 
Paige and Saunders, 1975]. Another attractive feature is that the matrix Z does not 
necessarily have to be provided explicitly; the algorithm only uses Z via matrix- vector 
products. 

The Lanczos procedure is given in Algorithm 1. For expositional purposes we define 
the core of the algorithm as Algorithm 2. We will later incorporate that part into other 
algorithms - see Section 4. 

4 Pairwise Algorithms 

Consider the commute time and Katz score between a single pair of nodes: 

d t j — Vol(G)(ei — ej) T L>(ei — e 3 ) and Kij — ef(I — aA)~ 1 ej — Sij. 

In these expressions, e^ and ej are vectors of zeros with a 1 in the ith and jth posi- 
tion, respectively; and Si.j is the Kronecker delta function. A straightforward means of 
computing them is to solve the linear systems 

L y = e j — e^ and (I — aA)x = ej . 

Then = Vol(G)(ei — e 3 ) T y and Ki.j = e[x — 5ij. It is possible to compute the pair- 
wise scores by solving these linear systems. In what follows, we show how a technique 
combining the Lanczos iteration and a quadrature rule [Golub and Mcurant, 1994, 1997] 
produces the pair-wise commute time score or the pair-wise Katz score as well as upper 
and lower bounds on the estimate. 
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4.1 Matrices moments and quadrature 

Both of the pairwise computations above are instances of the general problem of esti- 
mating a bilinear form: 

u T /(Z)v, 

where Z is symmetric positive definite (for Katz, this occurs by restricting the value of a, 
and for commute times, the adjusted Laplacian L is always positive definite), and fix) 
is an analytic function on the region containing the eigenvalues of Z . The only function 
f(x) we use in this paper is f(x) = -, although we treat the problem more generally for 
part of this section. 

Golub and Meurant [1994, 1997] introduced elegant computational techniques for 
evaluating such bilinear forms. They provided a solid mathematical framework and a 
rich collection of possible applications. These techniques are well known in the numer- 
ical linear algebra community, but they do not seem to have been used in data mining 
problems. We adapt this methodology to the pairwise score problem, and explain how to 
do so in an efficient manner in a large scale setting. The algorithm has two main compo- 
nents, combined together: Gauss-type quadrature rules for evaluating definite integrals, 
and the Lanczos algorithm for partial reduction to symmetric tridiagonal form. In the 
following discussion, we treat the case of u = v. This form suffices thanks to the identity 

u T /(Z)v = \[(u + v) T /(Z)(u + v) - (u - v) T /(Z)(u - v)] . 

Because Z is symmetric positive definite, it has a unitary spectral decomposition, 
Z = QAQ T , where Q is an orthogonal matrix whose columns are eigenvectors of Q with 
unit 2-norms, and A is a diagonal matrix with the eigenvalues of Q along its diagonal. 
We use this decomposition only for the derivation that follows, it is never computed in 
our algorithm. Given this decomposition, for any analytic function /, 

n 

u T /(Z)u = u T Q/(A)Q T u - J2 K X ^l 

i=l 

where u = Q T u. The last sum is equivalent to the Stieltjes integral 

u T /(Z)u = | /(A)du,(A). (1) 

Here w(A) is a piecewise constant measure, which is monotonically increasing, and its 
values depend directly on the eigenvalues of Z. Both A and A are values that are lower 
and higher than the extremal eigenvalues of Z , respectively. Let 

< Ai < A 2 < . . . < A„ 
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be the eigenvalues of Z. Note that A < Ai and A > A„. Now, oj(A) takes the following 
form: 



The first of Golub and Meurant's key insights is that we can compute an approxima- 
tion for an integral of the form (1) using a quadrature rule, e.g., 



where r)j,0Jj are the nodes and weights of a Gaussian quadrature rule. The second 
insight is that the Lanczos procedure constructs the quadrature rule itself. Since we use 
a quadrature rule, an estimate of the error is readily available, see for example Davis and 
Rabinowitz [1984]. More importantly, we can use variants of the Gaussian quadrature to 
obtain both lower and upper bounds and "trap" the value of the element of the inverse 
that we seek between these bounds. 

The ability to estimate bounds for the value is powerful and provides effective stop- 
ping criteria for the algorithm - we shall see this in the experiments in Section 6.2. It 
is important to note that such component-wise bounds cannot be easily obtained if we 
were to extract the value of the element from a column of the inverse, by solving the 
corresponding linear system, for example. Indeed, typically for the solution of a linear 
system, norm-wise bounds are available, but obtaining bounds pertaining to the compo- 
nents of the solution is significantly more challenging and results of this sort are harder 
to establish. It should also be noted that bounds of the sort discussed here cannot be 
obtained for general non-symmetric matrices. 

Returning to the procedure, let /(A) be a function where the (2k + l)st derivative 
has a negative sign for all A < A < A. Note that /(A) = 1/A satisfies this condition 
because all odd derivatives are negative when A > 0. As a high level algorithm, the 
Golub-Meurant procedure for estimating bounds 



is given by the following steps. 

1. Let a = ||u|| . 

2. Compute T k from k steps of the Lanczos procedure applied to Z and u/cr. 

3. Compute T fe , which the matrix 7\. extended with another row and column crafted 
to add the eigenvalue A to the eigenvalues of TV This new matrix is still tridiagonal. 

4. Set b = a 2 e{ f(T_ k )ei. This estimate corresponds to a (k + l)-point Gauss-Radau 
rule with a prescribed point of A. 




A < Ai 

Ai < A < A s:+ i 
E"=i A „ < A- 




b < u T /(Z)u < b 
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5. Compute Tk, which the matrix Tk extended with another row and column crafted 
to add the eigenvalue A to the eigenvalues of Tk- Again, this new matrix is still 
tridiagonal. 

6. Set b = cr 2 ef /(Tfc)ei. This estimate corresponds to a (fc + l)-point Gauss- Radau 
rule with a prescribed point of A. 

Based on the theory of Gauss- Radau quadrature, the fact that these are lower and upper 
bounds on the quadratic form u T f(Z)u follows because the sign of the error term changes 
when prescribing a node in this fashion. See Golub and Meurant [2010, Theorem 6.4] for 
more information. As fc increases, the upper and lower bounds converge. 

While this form of the algorithm is convenient for understanding the high level prop- 
erties and structure of the procedure, it is not computationally efficient. If /(A) = 1/A 
and if we want to compute a more accurate estimate by increasing fc, then we need to 
solve two inverse eigenvalue problems (steps 3 and 5), and solve two linear systems (steps 
4 and 6). Each of these steps involves O(k) work because the matrices involved are tridi- 
agonal. However, a constant-time update procedure is possible. The set of operations 
to efficiently update b and b after a Lanczos step (Algorithm 2) is given by Algorithm 3. 
Please see Golub and Meurant [1997] for an explanation of this procedure. Using Algo- 
rithms 2 and 3 as sub-routines, it is now straightforward to state the pairwise commute 
time and Katz procedures. 

4.2 Pairwise commute scores 

The bilinear form that we need to estimate a commute time is 

b=(e i -e j ) T L 1 (e i -e j ). 

For this problem, we apply Algorithm 2 to step through the Lanczos process and then 
use Algorithm 3 to update the upper and lower bounds on the score. This combination 
is explicitly described in Algorithm 4. Note that we do not need to apply the final 
correction with ^ee T because e T (ei — e^) = 0. 

4.3 Pairwise Katz scores 

The bilinear form that we need to estimate for a Katz score is 

b = e[(I-aA)- 1 e 1 . 

Recall that we use the identity: 

b = - [(e< + ejfil - aA)- 1 ^ + ej) - (e. t - e,) 7 '(I - aAy 1 ^ - e,)] 
^ N v ' v v ' 

=9 =h 

In this case, we apply the combination of LanczosStep and MMQStep to estimate g < 
g <g and h< h <h. Then \(g_ - h) < b < \(g - h). 
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Algorithm 3 MMQStep [Golub and Meurant, 1997, Algorithm GQL] 



Input: a, /3, 6_i, c_i, d-i, d__i, d_i 



2: d = a- A- Z=±; d = a-\-^- 

3: Uj = A + C; w = A+ ^ 

Output: (6,6) and (6, c, d, d, d) 



Algorithm 4 Pairwise Score Bounds for commute time 

Input: 1/ (Laplacian matrix); i, j (pairwise coordinate); A, A (bounds where A < \(L) < 

A); t (stopping tolerance) 
Output: k,k where k < (e, — ej) T L\ei — e^) < k 

h (Initialize Lanczos) a — \/2, q _ ± = 0, q = (e, — ej)/a, /3q = 

2: (Initialize MMQStep) 6 = 0,c = l,d = l,d = l,d = 1 

3: for j = 1, ... do 

4: Set (q^-, ctj, ft j) from LanczosStep(L, q,,_2, 
5: Set (6, 6) and (6j , Cj , dj , dj , dj ) from 

MMQStep(a J -,/3 J _i,/3j-,6j_i,c J _i,dj_i,d J _ 1 ,dj_i). 
6: k = a 2 b; k = a 2 b 

7: if K — K < T, Stop 

Algorithm 5 Pairwise Score Bounds for Katz 

Input: A (adjacency matrix); a (the Katz damping factor); i,j (pairwise coordinate); 
A, A (bounds where A < A(J — a A) < A); r (stopping tolerance) 

Output: p,p where p < (I — aA)^j < p 
h (Initialize Lanczos for g) a = \/2, q_ x = 0, q — (ej + ej)/a, /3q = 
2: (Initialize Lanczos for h) u i = 0, Uo = (ej — ej)/a, /3q = 
3: (Initialize MMQStep for g) b 9 = 0,eg = l,dg = l,dg = l,dj = 1 
4: (Initialize MMQStep for h) 6g = 0,c§ = l,d§ = 1,4 = l,d£ = 1 
5: for j = 1, ... do 

6: Set (q^-, a 9 , /3|) from LanczosStep((J — aA), q_y_ 2 , Qj-i> 
7: Set (uj, a^, /?j*) from LanczosStep((J — aA), Uj_2, Uj_i, 
8: Set (<?,£) and (6|, cj, d?, d^-, d|) from 

MMQStep(oj , t , /3| , 6f _ x , cf_ x , df_ x , dj _ t , d*_ x ). 

9: Set (h, h) and (&*, e£, d£, dj, d£) from 

MMQSte P (a^_ 1 , $ , b)_ x , c£_ 1; dj_ x , d*_i , dj_ x ) . 
10: p = ( T 2 /4(g-^);p- ( 7 2 /4(77 -/i) 

11: if p — p < T, Stop 
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5 Column-wise algorithms 



Whereas the last section used a single procedure to derive two algorithms, in this section, 
we investigate two different procedures: one for commute time and a different procedure 
for Katz scores. The reason behind this difference is that, as mentioned in the introduc- 
tion, computing a column of the commute time matrix cannot be stated as the solution 
of a single linear system: 

Cj = Ce, = vol(G)[(e l - e v ) T L (e* - e„) : 1 < v < n]. 

Computing this column requires all of the diagonal elements of the inverse. In contrast, 
a column of the Katz matrix is just the solution of a linear system: 

kj = Ke t = (I - aA)~ 1 e i - e^. 

For this computation, we exploit an empirical localization property of these columns. 

5.1 Column-wise commute times 

A straightforward way to compute an entire column of the commute time matrix would 
require solving n separate linear systems: one to get both L et and L i i , and the other 
n — 1 to get Lj j for i =/= j. Neither solving each system independently, nor using a multiple 
right hand side algorithm [O'Leary, 1980], will easily yield an efficient procedure. Both 
of these approaches generate far too much extraneous information. In fact, we only need 
one linear system solve, and the diagonal elements of the pseudo-inverse. Thus, any 
procedure to compute or estimate diag(-L^) provides a tractable algorithm. 

One such procedure arises, again, from the Lanczos method. It was originally de- 
scribed by Paige and Saunders [1975], and is explained in more detail in Chantas et al. 
[2008]. Suppose we want to compute diag(i ). If the Lanczos algorithm runs to com- 
pletion in exact arithmetic, then we have: 

L = QTQ T and L 1 = QT X Q T . 

Let T = RR T be a Cholesky factorization of T. If we substitute this factorization into 
the expression for the inverse, then L = VR R V . Now, let W — VR~ . Note 

1 rp 

that L = WW . As a notational convenience, let w> be the kth column of W. 
Consequently, 

n 

diag( J E _1 ) = ^ w fe o w fc 

k=l 

where o is the Hadamard (element-wise) product: [w^ o Wfc]j = w^. If we 
implement CG based on the Lanczos algorithm as explained in Paige and Saunders [1975], 
then the vector is computed as part of the standard algorithm, and is available at 
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no additional cost. This idea is implemented in the cgLanczos .m code [Saunders, 2007], 
which we use in our experiments. Please see Chantas et al. [2008] for a detailed account 
of this derivation including the diagonal estimate. 

Based on advice from the author of the cgLanczos code, we added local reorthogonal- 
ization to the Lanczos procedure. This addition requires a few extra vectors of memory, 
but ensures greater orthogonality in the computed Lanczos vectors q fc . Also, based on 
advice from the author, we use the following preconditioned linear system: 



If f is the estimate of the diagonals of {D~ 1 I 2 LD~ 1 / 2 )~ 1 , then D~ 1 { is the estimate of 
the diagonals of L . Using this preconditioned formulation, the algorithm converged 
much more quickly than without preconditioning. In summary, this approach to estimate 
the column- wise commute times Cj is: 

1. Solve D^ 1 ! 2 LD~ 1 ^ 2 y = D~ 1 ^ 2 ei using cgLanczos .m 



to get both y and f « diag ((D~ 1/2 LD- 1 / 2 )- 1 ) . 

2. Set x = D~ 1/2 y - ±e L^e,. 

3. Set g = D H - ±e « diagCL 1 '). 

4. Output Ci « g + Xie — 2x. 

We refrain from stating this as a formal algorithm because the majority of the work is 
in the cgLanczos .m routine. 

5.2 Column- wise Katz scores 

In this section, we show how to adapt techniques for rapid personalized PageRank com- 
putation [McSherry, 2005, Andersen et al., 2006, Bcrkhin, 2007] to the problem of com- 
puting a column of the Katz matrix. Recall that such a column is given by the solution 
of a single linear system: 



The algorithms for personalized PageRank exploit the graph structure by accessing the 
edges of individual vertices, instead of accessing the graph via a matrix-vector product. 
They are "local" because they only access the adjacency information of a small set of 
vertices and need not explore the majority of the graph. Such a property is useful when 
the solution of a linear system is localized on a small set of elements. 

Localization is a term with a number of interpretations. Here, we use it to mean that 
the vector becomes sparse after rounding small elements to 0. A nice way of measuring 
this property is to look at the participation ratios [Farkas et al., 2001]. Let k be a column 
of the Katz matrix, then the participation ratio of k is 



D -i/*lD-i/3 y = D-^ 2 e t . 



ki = Ke t = (I — aA) 1 e i 



P = 
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Table 2 - Participation ratios for Katz scores. These results demonstrate that the columns of 
the Katz matrix are highly localized. In the worst case, there are only a few thousand large 
elements in a vector, compared with the graph size of a few hundred thousand vertices. 



Graph Vertices Avg. Deg. Participation Ratios 









Min 


Mean 


Median 


Max 


tapir 


1 AO A 

1U24 


r £t 
0.0 


1 o 

4.2 


ion 
12. U 


1 1 o 

11.0 


30.8 


stanford-cs-sym 


2 ( 09 


1 .4 


1.0 


Of? O 

2b. 3 


oo c 
23.0 


2 (4.1 


ca-GrQc 


4158 


6.5 


1.0 


27.4 


34.0 


84.2 


wiki-Vote 


7066 


28.5 


1.2 


248.8 


291.6 


342.6 


ca-HepTh 


8638 


5.7 


1.0 


23.5 


29.8 


82.1 


ca-HepPh 


11204 


21.0 


1.0 


160.7 


256.1 


268.5 


Stanford3 


11586 


98.1 


1.1 


1509.5 


1657.8 


1706.4 


ca-AstroPh 


17903 


22.0 


1.0 


167.5 
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This ratio measures the number of effective non-zeros of the vector. If k is a uniform 
vector, then p = n, the size of the vector. If k has only a single element, then p = 1, the 
number of states occupied. For a series of graphs we describe more formally in Section 6.1, 
we show the statistics of some participation ratios in Table 2. We pick columns of the 
matrix in two ways: (i) randomly and (ii) from the degree distribution to ensure we 
choose both high, medium, and low degree vertices. See Section 6.7 for a more formal 
description about how we pick columns; we use the "hard alpha" value of Katz described 
in the experiments section. The results show that number of effective non-zeros is always 
less than 10,000, even when the graph has 1,000,000 vertices. Usually, it is even smaller. 
Our forthcoming algorithms exploit this property. 

The basis of these personalized PageRank algorithms is a variant on the Richardson 
stationary method for solving a linear system [Varga, 1962]. Given a linear system 
Zx = b, the Richardson iteration is 

x (fe+i) =X W + r W ) 

where r^ fc ' = b — Zx^ is the residual vector at the fcth iteration. While updating 
x (fe+i) j g a i mear ti m e operation, computing the next residual requires another matrix- 
vector product. To take advantage of the graph structure, the personalized PageRank 
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algorithms [McSherry, 2005, Andersen ct al., 2006, Berkhin, 2007] propose the following 
change: do not update x( fc+1 ' with the entire residual, and instead change only a single 
component of x. Formally, x' fc+1 ) = x( fe ) +rj fc ' ) ej, where ej is a vector of all zeros, except 

(k) 

for a single 1 in the jth position, and rj is the jth component of the residual vector. 
Now, computing the next residual involves accessing a single column of the matrix Z: 

r (fc+D = b - Zx( fc+1 ) = b Z(x« + r^ej) = r« + rf Ze 3 . 

Suppose that r, x, and Ze.j are sparse, then this update introduces only a small number of 
new nonzeros into both x and the new residual r. If Z = (I — a A), as in the case of Katz, 
then each column is sparse, and thus keeping the solution and residual sparse is a natural 
choice for graph algorithms where the solution x is localized (i.e., many components of 
x can be rounded to without dramatically changing the solution). By choosing the 
element j based on the largest entry in the sparse residual vector (maintained in a heap) , 
this algorithm often finds a good approximation to the largest entries of the solution 
vector x while exploring only a small subset of the graph. The resulting procedure is 
presented in Algorithm 6. For reasons that will become clear below, we call this procedure 
the Gauss-Southwell algorithm. When experimenting with this method, we found that 
picking elements from the heap proportional to D~ 1 r instead of r yielded convergence 
with fewer total edges explored, mirroring the results in Andersen ct al. [2006]. We use 
this version in all of our experiments, although we state all the formal convergence results 
for the simple choice of residual r. 

Algorithm 6 Column-wise Katz scores (via the Gauss- Southwell algorithm) 

Input: A (the adjacency matrix), a (the Katz damping factor), i (the desired column), 

t (a stopping tolerance). 
Output: x (an approximate solution of (I — aA)~ 1 ei) 
l: Set x = 0, r = 

2: Let H be a heap over the non-zero entries of r larger than r. 

3: Set n = l, update W. 

4: while T~L is not empty do 

5: Set j as the index of the largest element in H 
6: if Tj < t then quit. 

7: Tj = Tj 

8: Xj Xj + T) 

9: r.j <— 0, remove j from % 

10: for u where Aj tU > do 
li: r u <- r u + arj 

12: if r u > t then insert j in H or update H. 
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Let d max be the maximum degree of a node in the graph, then each iteration takes 
0(d max logn) time. We analyze the convergence of this algorithm for Katz scores in 
two stages. In the first case, when a < l/(i max , then the convergence theory of this 
method for personalized PageRank also shows that it converges for Katz scores. This 
fortunate occurrence results from the equivalence of Katz scores and the general formu- 
lation of PageRank adopted by McSherry [2005] in this setting. In the second case, when 
a < l/cr max (A), then (J — a A) is still symmetric positive definite, and the Richardson 
algorithm converges. To show convergence in this case, we will utilize an equivalence 
between this algorithm and a coordinate descent method. 

For completeness, we show a precise convergence result when a < l/d max . The key 
observation here is that the residual r is always non-negative and that the sum of the 
residual (e T r) is monotonically decreasing. To show convergence, we need to bound this 
sum by a function that converges to 0. 

Consider the algorithm applied to (I — aA)x = e,. From step k to step k + 1, the 
algorithm sets 

x (fc+i) = X W + ve .. r (fe+i) = r « + ^(j _ a A)e 3 . 

First note that a < l/<i max implies A k+1 ^ > given rf^ > 0. This bound now implies 
that xf +1) > when xf ) > 0. Since these conditions hold for the initial conditions, 
x(°) = and r^ ^ = e q , they remain true throughout the iteration. Consequently, we can 
use the sum of as the 1-norm of this vector, that is, e T r^ fc+1 - ) = ||r ( - fe+1 )|| 1 . It is now 
straightforward to analyze the convergence of this sum: 

e T r (fc+i) _ e T r (fc) _ rj + arje T Aei. 

At this point, we need the bound that rj — > {1 / n)e T r^ k \ which follows immediately 
from the fact that is the largest element in r( fc ). Also, e T Aei < d max . Thus, we 
conclude: 

REMARK 1 If a < l/d max , then the 1-norm of the residual in the Gauss-Southwell itera- 
tion applied to the Katz linear system satisfies 

HrO+i)^ <^1- 1 ~ admarr ^|| r (fc)|| i <(l- 1 ~ admax \ 

In the second case, when l/cf max < a < l/cr max (A), then the Gauss-Southwell itera- 
tion in Algorithm 6 still converges, however, the result is more intricate than the previous 
case because the sum of the residual does not converge monotonically. As we shall see, 
treating this linear system as an optimization problem provides a way to handle this case. 
Let Z be symmetric positive definite. We first show that the Gauss-Southwell algorithm 
is a coordinate descent method for the convex problem 

minimize ^x T .Zx — x T b = /(x). 
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The gradient of this problem is 2Tx — b, hence a stationary point is the solution of the 
linear system, and the global minimizer. In this framework, the Richardson method is a 
gradient descent method. Let gW be the gradient at step k, = Zx^' — b then 

X (fe+D = x « _ g 

is exactly the Richardson step. 

Now consider a standard coordinate descent method. Such methods usually minimize 
the function in the jth coordinate exactly. Formally, they find 

X (fc+D =x (fe) +7 W ej 

where 

7 (fc) = argmin 7 /(x (fc) + jej). 
Solving this system produces the choice 

Note that in terms of the optimization problem the Gauss-Southwell algorithm generates 

7 W = rf > = ( bj - zJxW). 
The two methods are equivalent if the diagonals of A are 1. Consequently, we have: 

LEMMA 2 The Gauss- Southwell method for Zx = h with Z+j — 1 is equivalent to a 
coordinate gradient descent method for the function /(x) = (l/2)x T .Zx — x T b. 

To produce a convergent algorithm, we must now specify how to choose the descent 
direction j. 

THEOREM 3 Let Z be symmetric positive definite with Zij = 1. Then the Gauss- 
Southwell method for Zx = h and — argmaxjr^ | or with chosen cyclically 
(j = 1; j^ k+1 ' = j( k ' + 1 mod n) is convergent. 

Proof This result follows from the convergence of the coordinate descent method [Luo 
and Tseng, 1992, Theorem 2.1] with these two update rules. The first is also known as 
the Gauss- Southwell rule. ■ 

This proof demonstrates that, as long as Ai,i = for all the diagonal entries of the 
adjacency matrix, then Algorithm 6 will converge when (I — a A) is positive definite, that 
is, when a < l/er max (A). We term this algorithm a Gauss-Southwell procedure because 
the choice of j in the algorithm is given by the Gauss-Southwell rule. 
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6 Experimental Results 



The previous sections showed three algorithms based on the Lanczos method, and showed 
the theoretical convergence of the column-wise Katz algorithm. In this section, we inves- 
tigate these algorithms numerically. Algorithms based on the Lanczos method, in general, 
are arguably best studied empirically because their worst-case convergence properties are 
often conservative. These experiments are designed to shed light on two key questions: 

1. How do these iterative algorithms converge to the exact solution? 

2. Are the techniques faster than a conjugate gradient based algorithm? 

Note that column-wise commute time measure is a special case for reasons we discuss 
below, and we only investigate the accuracy of our procedure for that problem. 

Experimental settings We implemented our methods in Matlab and Matlab mex 
codes. All computations and timings were done in Linux on a desktop with a Core i7-960 
processor (4 core, 2.8GHz) with 24GB of memory. As mentioned in the introduction, 
all of the experimental code is available from http://cs.purdue.edu/homes/dgleich/ 
publications/2011/ codes/f ast-katz/. 

We first describe the data used in the experiments. These data were also used in the 
experiment about localization in the Katz scores from the previous section. 

6.1 Data 

We use three publicly available sources and three graphs we collected ourselves. The 
majority of the data comes from the SNAP collection [Leskovec, 2010] of which, we 
use ca-GrQc, ca-HepTh, ca-CondMat, ca-AstroPh, email-Enron, email-EuAll [Leskovec 
ct al., 2007], wiki-Vote [Leskovec et al., 2010], soc-Epinionsl [Richardson et al., 2003], and 
soc-Slashdot0811 [Leskovec et al., 2009]. Besides these, the graph tapir is from Gilbert 
and Teng [2002], the graph Stanford3 is from [Traud et al., 2011], and both graphs 
stanford-cs [Hirai et al., 2000] and holly wood-2009 [Boldi et al., 2011] are distributed via 
the webgraph framework [Boldi and Vigna, 2004]. The graph stanford-cs is actually a 
subset of the webbase-2001 graph [Hirai et al., 2000], restricted only to the pages in the 
domain cs.stanford.edu. All graphs are symmetrized (if non-symmetric) and stripped of 
any self-loops, edge weights, and extraneous connected components. 

DBLP We extracted the DBLP coauthors graph from a recent snapshot (2005-2008) 
of the DBLP database. We considered only nodes (authors) that have at least three 
publications in the snapshot. There is an undirected edge between two authors if they 
have coauthored a paper. From the resulting set of nodes, we randomly chose a sample 
of 100,000 nodes, extracted the largest connected component, and discarded any weights 
on the edges. 
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arXiv This dataset contains another coauthorship graph extracted by a snapshot (1990- 
2000) of arXiv, which is an e-print service owned, operated and funded by Cornell Uni- 
versity, and which contains bibliographies in many fields including computer science and 
physics. This graph is much denser than DBLP. Again, we extracted the largest con- 
nected component of this graph and only work with that subset. 

Flickr contacts Flickr is a popular online-community for sharing photos, with millions 
of users. The node set represents users, and the directed edges are between contacts. 
We start with a crawl extracted from Flickr in May 2006. This crawl began with a 
single user and continued until the total personalized PageRank on the set of uncrawled 
nodes was less than 0.0001. The result of the crawl was a graph with 820,878 nodes 
and 9,837,214 edges. In order to create a sub- graph suitable for our experimentation 
we performed the following steps. First, we created a graph from Flickr by taking all 
the contact relationships that were reciprocal, and second, we again took the largest 
connected component. (This network is now available from the University of Florida 
sparse matrix collection [Davis and Hu, 2010]). 

Table 3 presents some elementary statistics about these graphs. We also include the 
time to compute the truncated singular value decomposition for the first 200 singular 
values and vectors using the ARPACK library [Lehoucq et al., 1997] in Matlab's svds 
routine. This time reflects the work it would take to use the standard low-rank prepro- 
cessing algorithm for Katz scores on the network [Libcn-Nowell and Kleinberg, 2003]. 

6.2 Pairwise commute scores 

From this data, we now study the performance of our algorithm for pairwise commute 
scores, and compare it against solving the linear system Lx = (e,-— e^) using the conjugate 
gradient method (CG). At each step of CG, we use the approximation (ej — ej) T x( k \ 
where x( fe ) is the kh iterate. The convergence check in CG was either the pairwise element 
value changed by less than the tolerance, checked by taking a relative difference between 
steps, or the 2-norm of the residual fell below the tolerance. 

The first figure we present shows the result of running Algorithm 4 on a single pairwise 
commute time problem for few graphs (Figure 1). The upper row of figures show the 
actual bounds themselves. The bottom row of figures shows the relative error that would 
result from using the bounds as an approximate solution. We show the same results for 
CG. The exact solution was computed by using MINRES [Paige and Saunders, 1975] to 
solve the same system as CG to a tolerance of 10~ 10 . For all of the graphs, we used 
A = 10 -4 and A = 1 1 1 1 ]_ - Again using ARPACK, we verified that the smallest eigenvalue 
of each of the Laplacian matrices was larger than A. We chose the vertices for the pair 
from among the high-degree vertices for no particular reason. Both Algorithm 4 and CG 
used a tolerance of 10 -4 . 
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Table 3 - The networks studied in the experiments. The first five columns are self explanatory. 
The last two columns show the largest singular value of the network, which is also the matrix 
2-norm, and the time taken to compute the largest 200 singular values and vectors. 

Graph Nodes Edges Avg. Deg. Max Deg. ||A|| 2 SVD (sec.) 
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Figure 1 - Convergence results for pairwise commute times. (Top row) Each figure shows the 
upper and lower bounds at each iteration of Algorithm 4 for the graphs dblp, arxiv, flickr2, 
and hollywood-2009. (Bottom row) For the same graphs, each figure shows the relative size of 
the error, (« a lg — u C xact ) /«exact in the upper and lower bounds at each iteration. In both cases, 
we also show the same data from the conjugate gradient algorithm. See Section 6.2 for our 
discussion. 
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In the figure, the upper bounds and lower bounds "trap" the solution from above and 
below. These bounds converge smoothly to the final solution. For these experiments, the 
lower bound has smaller error, and also, this error tracks the performance of CG quite 
closely. This behavior is expected in cases where the largest eigenvalue of the matrix is 
well-separated from the remaining eigenvalues - a fact that holds for the Laplacians of our 
graphs, see Mihail and Papadimitriou [2002] and Chung et al. [2003] for some theoretical 
justification. When this happens, the Lanczos procedure underlying both our technique 
and CG quickly produces an accurate estimate of the true largest eigenvalue, which in 
turn eliminates any effect due to our initial overestimate of the largest eigenvalue. (Recall 
from Algorithm 4 that the estimate of A is present in the computation of the lower-bound 

M 

Here, the conjugate gradient method suffers two problems. First, because it does 
not provide bounds on the score, it is not possible to terminate it until the residual is 
small. Thus, the conjugate gradient method requires more iterations than our pairwise 
algorithm. Note, however, this result is simply a matter of detecting when to stop 
- both conjugate gradient and our lower-bound produce similar relative errors for the 
same work. Second, the relative error for conjugate gradient displays erratic behavior. 
Such behavior is not unexpected because conjugate gradient optimizes the A-norm of the 
solution error and it is not guaranteed to provide smooth convergence in true error norm. 
These oscillations make early termination of the CG algorithm problematic, whereas no 
such issues occur for the upper and lower bounds from our algorithm. We speculate that 
the seemingly smooth convergence behavior that we observe for the upper and lower 
bound estimates may be rooted in the convergence behavior of the largest Ritz value of 
the tridiagonal matrix associated with Lanczos, but a better understanding of this issue 
will require further exploration. 

6.3 Pairwise Katz scores 

We next show the same type of figure but for the pairwise Katz scores instead; see 
Figure 2. We use a value of a that makes J — aA nearly indefinite. Such a value 
produces the slowest convergence in our experience. The particular value we use is 
a = 1/(|| j4||2 + 1), which we call "hard alpha" in some of the figure titles. For all of 
the graphs, we again used A = 10~ 4 and A = \\L\l-j_. This value of A is smaller than the 
smallest eigenvalue of J — aA for all the graphs. Also, the vertex pairs are the same as 
those used for Figure 1. For pairwise Katz scores, the baseline approach involves solving 
the linear system (J — aA)x = ej, again using the conjugate gradient method (CG). At 
each step of CG, we use the approximation ef x' fe ' , where is the kh iterate. We use 
the same convergence check as in the CG baseline for commute time. For these problems, 
we also evaluated techniques based on the Neumann scries for J — a A, but those took 
over 100 times as many iterations as CG or our pairwise approach. The Neumann series 
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Figure 2 - Convergence results for pairwise Katz scores. (Top row.) Each figure shows the 
upper and lower bounds at each iteration of Algorithm 5 for the graphs dblp, arxiv, flickr2, and 
holly wood-2009. (Bottom row.) For the same graphs, each figure shows the relative size of the 
error, (u a i g — fcxact ) /^cxact in the upper and lower bounds at each iteration. In both cases, we 
also show the same data from the conjugate gradient algorithm. See Section 6.3 for discussion. 



is the same algorithm used in [Wang et al., 2007] but customized for the linear system, 
not matrix inverse, which is a more appropriate comparison for the pairwise case. Finally, 
the exact solution was again computed by using MINRES [Paige and Saunders, 1975] to 
solve the same system as CG to a tolerance of 10~ 14 . 

A distinct difference from the commute-time results is that both the lower and upper 
bounds converge similarly and have similar error. This occurs because of the symmetry 
in the upper and lower bounds that results from using the MMQ algorithm twice on the 
form: ( 1 /4) [(e, + ) T ( I — a A) ~ 1 (e, + ej ) — (e, — ej ) T (7 — a A) ~ 1 (e j — ej )] . In comparison 
with the conjugate gradient method, our pairwise algorithm is slower to converge. While 
the conjugate gradient method appears to outperform our pairwise algorithms here, recall 
that it does not provide any approximation guarantees. Also, the two matrix-vector 
product in Algorithm 5 can easily be merged into a single "combined" matrix-vector 
product algorithm. As we discuss further in the conclusion, such an implementation 
would reduce the difference in runtime between the two methods. 

6.4 Relative matrix-vector products in pairwise algorithms 

Thus far, we have detailed a few experiments describing how the pairwise algorithms 
converge. In these cases, we compared against the conjugate gradient algorithm for 
a single pair of vertices on each graph. In this experiment, we examine the number of 
matrix- vector products that each algorithm requires for a much larger set of vertex pairs. 
Let us first describe how we picked the vertices for the pairwise comparison. There were 
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Figure 3 - (Left) Relative performance between Algorithm 4 and conjugate gradient for pairwise 
commute times. (Right) Relative performance between Algorithm 5 and conjugate gradient 
for pairwise Katz scores. The relative performance measure is (k cg — k a \g)/k cg , where k is the 
number of matrix-vector products taken by each approach. 



two types of vertex pairs chosen: purely random, and degree-correlated. The purely 
random choices are simple: pick a random permutation of the vertex numbers, then use 
pairs of vertices from this ordering. The degree correlated pairs were picked by first 
sorting the vertices by degree in decreasing order, then picking the 1st, 2nd, 3rd, 4th, 
5th, 10th, 20th, 30th, 40th, 50th, 100th,. .. vertices from this ordering, and finally, use 
all vertex pairs in this subset. Note that for commute time, we only used the 1st, 5th, 
10th, 50th, 100th,. . . . vertices to reduce the total computation time. For the pairwise 
commute times, we used 20 random pairs, and used 100 random pairs for pairwise Katz 
scores. 

In Figure 3, we show the matrix- vector performance ratio between our pair- wise 
algorithms and conjugate gradient. Let k cg be the number of matrix-vector products 
until CG converges to a tolerance of 10 -4 (as in previous experiments); and let A; a i g be 
the number of matrix-vector products until our algorithm converges. The performance 
ratio is 

k C g &alg 

k ' 

which has a value of when the two algorithms take the same number of matrix-vector 
products, the value 1 when our algorithm takes matrix-vector products, and the value 
-1 (or -2) when our algorithm takes twice (or thrice) as many matrix-vector products 
as CG. We display the results as a box-plot of the results from all trials. There was no 
systematic difference in the results between the two types of vertex pairs (random or 
degree correlated). 

These results show that the small sample in the previous section is fairly representative 
of the overall performance difference. In general, our commute time algorithm uses fewer 
matrix-vector products than conjugate gradient. We suspect this result is due to the 
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ability to stop early as explained in Section 6.2. And, as also observed in Section 6.3, 
our pairwise Katz algorithm tends to take 2-3 times as many matrix vector products as 
conjugate gradient. These results again used the same "hard alpha" value. 

6.5 Column-wise commute times 

Our next set of results concerns the precision of our approximation to the column-wise 
commute time scores. Because the output of our column- wise commute time algorithm is 
based on a coarse approximation of the diagonal elements of the inverse, we do not expect 
these scores to converge to their exact values as we increase the work in the algorithm. 
Consequently, we study the results in terms of the precision at k measure. Recall that 
the motivation for studying these column-wise measures is not to get the column scores 
precisely correct, but rather to identify the closest nodes to a given query or target node. 
That is, we are most interested in the smallest elements of a column of the commute 
time matrix. Given a target node i, let S^ ls be the k closest nodes to i in terms of our 
algorithm. Also, let St be the k closest nodes to i in terms of the exact commute time. 
(See below for how we compute this set.) The precision at k measure is 

\s* k ns^\/k. 

In words, this formula computes the fraction of the true set of k nodes that our algorithm 
identifies. 

We ran the algorithm from Section 5.1 with a tolerance of 10 -16 to evaluate the 
maximum accuracy possible with this approach. We choose 50 target nodes randomly 
and also based on the same degree sequence sampling mentioned in the last section. For 
values of k between 5 and 100, we show a box-plot of the precision at k scores for four 
networks in Figure 4. In the same figure, we also show the result of using the heuristic 
dj ss -jj— + jj— suggested by von Luxburg et al. [2010]. This heuristic is called "inverse 
degree" in the figure, because it shows that the set S k should look like the set of k nodes 
with highest degree or smallest inverse degree. 

These results show that our approach for estimating a column of the commute time 
matrix provides only partial information about the true set. However, these experiments 
reinforce the theoretical discussion in von Luxburg et al. [2010] that commute time 
provides little information beyond the degree distribution. Consequently, the results 
from our algorithm may provide more useful information in practice. Although such a 
conclusion would require us to formalize the nature of the approximation error in this 
algorithm, and involve a rather different kind of study. 

Exact commute times Computing commute times is challenging. As part of a sep- 
arate project, the third author of this paper wrote a program to compute the exact 
eigenvalue decomposition of a combinatorial graph Laplacian in a distributed computing 
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Figure 4 - Precision at A; for the column-wise commute time approximations (top) over a few 
hundred trial columns. Precision at k for the inverse degree heuristic (bottom) over the same 
columns. These figures show standard box-plots of the result for each column. 

environment using the MPI and the ScaLAPACK library [Blackford et al., 1996]. This 
program ignores the sparsity in the matrix and treats the problem as a dense matrix. We 
adapted this software to compute the pseudo-inverse of the graph Laplacian as well as 
the commute times. We were able to run this code on graphs up to 100,000 nodes using 
approximately 10-20 nodes of a larger supercomputer. (The details varied by graph, and 
are not relevant for this paper.) For graphs with less than 20,000 nodes, the same pro- 
gram will compute all commute-times on the previously mentioned desktop computer. 
Thus, we computed the exact commute times for all graphs except email-euAll, fhckr2, 
and hollywood-2009. 

6.6 Column-wise Katz scores 

We now come to evaluate the local algorithm for Katz scores. As with the pairwise 
algorithms, we first study the empirical convergence of the algorithm. However, the 
evaluation for the convergence here is rather different. Recall, again, that the point of 
the column-wise algorithms is to find the most closely related nodes. For Katz scores, 
these are the largest elements in a column (whereas for commute time, they were the 
smallest elements in the column). Thus, we again evaluate each algorithm in terms of 
the precision at k for the top-k set generated by our algorithms and the exact top-fc set 
produced by solving the linear system. Natural alternatives are other iterative methods 
and specialized direct methods that exploit sparsity. The latter - including approaches 
such as truncated commute time [Sarkar and Moore, 2007] - are beyond the scope of 
this work, since they require a different computational treatment in terms of caching and 
parallelization. Thus, we again use conjugate gradient (CG) as our point of comparison. 
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The exact solution is computed by solving (J — aA)ki = e*, again using the MINRES 
method, to a tolerance of 10 -12 . 

We also look at the Kendall-r correlation coefficient between our algorithm's results 
and the exact top-A: set. This experiment will let us evaluate whether the algorithm is 
ordering the true set of top-k results correctly. Let xt? be the scores from our algorithm 
on the exact top-fc set, and let x^, be the true top-k scores. The r coefficients are 
computed between xf? and . 

Both of the precision at k and the Kendall-r measures should tend to 1 as we increase 
the work in our algorithm. Indeed, this is what we observe in Figure 5. For these figures, 
we pick a vertex with a fairly large degree and run Algorithm 6 with the "hard alpha" 
value mentioned in previous sections. As the algorithm runs, we track work with respect 
to the number of effective matrix vector products. An effective matrix-vector product 
corresponds to our algorithm examining the same number of edges as a matrix-vector 
product. For example, suppose the algorithm accesses a total of 80 neighbors in a graph 
with 16 edges. Then this instance corresponds to 2.5 effective matrix vector products. 
The idea is that the amount of work in one effective matrix vector product is about the 
same as the amount of work in one iteration of CG. Hence, we can compare algorithms 
on this ground. As evident from the legend in each figure, we look at precision at k for 
four values of fc, 10, 25, 100, 1000, and also the Kendall-r for these same values. While 
all of the measures should tend to 1 as we increase work, some of the exact top-fc results 
contain tied values. Our algorithm has trouble capturing precisely tied values and the 
effect is that our Kendall-r score does not always tend to 1 exactly. 

For comparison, we show results from the conjugate gradient method for the top-25 
set after 2,5,10,15,25, and 50 matrix- vector products. In these results, the top-25 set 
is nearly converged after the equivalent of a single matrix- vector product - equivalent to 
just one iteration of the CG algorithm. The CG algorithm does not provide any useful 
information until it converges. Our top-fc algorithm produces useful partial information 
in much less work. 

6.7 Runtime 

Finally, we show the empirical runtime of our implementations in Tables 4 and 5. Table 4 
describes the runtime of the two pairwise algorithms. We show the 25th, 50th, and 75th 
percentiles of the time taken to compute the results from Figure 3. Our implementation 
is not optimized, and so these results indicate the current real-world performance of the 
algorithms. 

Table 5 describes the runtime of the column-wise Katz algorithm. Here, we picked 
columns of the matrix to approximate in two ways: (i) randomly, and (ii) to sample the 
entire degree distribution. As in previous experiments, we took the 1st, 2nd, 3rd, 4th, 
5th, 10th, 20th,. . .vertices from the set of vertices sorted in order of decreasing degree. 
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Figure 5 - Convergence results for our column-wise Katz algorithm in terms of the precision 
of the top-A; set (left) and the ordering of the true top-A; set (right). See Section 6.6 for the 
discussion. 
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For each column picked in this manner, we ran Algorithm 6 and recorded the wall clock 
time. The 25th, 50th, and 75th percentiles of these times are shown in the table for each 
of the two sets of vertices. 

For this algorithm, the degree of the target node has a considerable impact on the 
algorithm runtime. This effect is particularly evident in the flick2 data. The randomly 
chosen columns are found almost instantly, whereas the degree sampled columns take 
considerably longer. A potential explanation for this behavior is that starting at a vertex 
with a large degree will dramatically increase the residual at the first step. If these new 
vertices also have a large degree, then this effect will multiply and the residual will rise 
for a long time before converging. Even in the cases where the algorithm took a long 
time to converge, it only explored a small fraction of the graph (usually about 10% of the 
vertices), and so it retained its locality property. This property suggests that optimizing 
our implementation could reduce these runtimes. 

7 Conclusion and Discussion 

The goal of this manuscript is to estimate commute times and Katz scores in a rapid 
fashion. Let us summarize our contributions and experimental findings. 

• For the pair-wise commute time problem, we have implemented Algorithm 4, based 
on the relationship between the Lanczos process and a quadrature rule (Section 4.1). 
This algorithm uses a similar mechanism to that of conjugate gradient (CG). It 
outperforms the latter in terms of total matrix- vector products, because it provides 
upper and lower bounds that allow for early termination, whereas CG does not 
provide an easy way of detecting convergence for a specific pairwise score. 

• For the pair-wize Katz problem, we have proposed Algorithm 5, also based on the 
same quadrature theory. This algorithm involves two simultaneous Lanczos iter- 
ations. In practice, this means more work per iteration than a simple approach 
based on CG. A careful implementation of Algorithm 5 would merge the two Lanc- 
zos processes into a "joint process" and perform the matrix-vector products simul- 
taneously. In our tests of this idea, we have found that the combined matrix- vector 
product took only 1.5 times as long as a single matrix- vector product. 

• For the column-wise commute time problem, we have investigated a variation of 
the conjugate gradient method that also provides an estimate of the diagonals 
of the matrix inverse. We have found that these estimates were fairly crude ap- 
proximations of the commute time scores. We have also investigated whether the 
degree-based heuristic from von Luxburg ct al. [2010] provides better information. 
It indeed seems to perform better, which suggests that the smallest elements of a 
column of the commute-time matrix may not be a useful set of useful related nodes. 

• For the column-wise Katz algorithm, we have proposed Algorithm 6 based on the 
techniques used for personalized PageRank computing. The idea with these tech- 
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Table 4 - Runtime of the pair-wise algorithms. The "0.0" second entries are rounded down 
for display. These are really just less than 0.1 seconds. The three columns for each type of 
problem show the 25th, 50th, and 75th percentiles of the wall-clock time to compute the results 
in Figure 3. 
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niques is to exploit sparsity in the solution vector itself to derive faster algorithms. 

We have shown that this algorithm converged in two cases: Remark 1, where we 

established a precise convergence result, and Theorem 3, where we only established 

asymptotic convergence. 
We believe that these results paint a useful picture of the strengths and limitations of 
our algorithms. Here are a few possible directions for future work: 

Alternatives for pair-wise Katz. First, there are alternatives to using the identity 
u T /(Z)v = (1/4) (u + v) T /(Z)(u + v) - (l/4)(u - v) T /(Z)(u - v) in the u ^ v 
case. The first alternative is based on the nonsymmetric Lanczos process [Golub and 
Meurant, 2010]. This approach still requires two matrix- vector products per iteration, 
but it directly estimates the bilinear form and also provides upper and lower bounds. 
A concern with the nonsymmetric Lanczos process is that it is possible to encounter 
degeneracies in the recurrence relationships that stop the process short of convergence. 
Another alternative is based on the block Lanczos process [Golub and Meurant, 2010]. 
However, this process does not yet offer upper and lower bounds. 

A theoretical basis for the localization of Katz scores. The inspiration for the 
column-wise Katz algorithm were the highly successful personalized PageRank algo- 
rithms. The localization of these personalized PageRank vectors was made precise in 
a theorem from Andersen et al. [2006] that related the personalized PageRank vector to 
cuts in the graph. In brief, if there is a good cut nearby a vertex, then the personalized 
PageRank vector will be localized on a few vertices. An interesting question is whether 
or not Katz matrices enjoy a similar property. We hope to investigate this question in 
the future. 
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